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Introduction. Despite the popularity of crowdsourcing, the reliability of crowdsourced 
output has been questioned since crowdsourced workers display varied degrees of 
attention, ability and accuracy. It is important, therefore, to understand the factors that 
affect the reliability of crowdsourcing. In the context of producing relevance judgments, 
crowdsourcing has been recently proposed as an alternative approach to traditional 
methods of information retrieval evaluation, which are mostly expensive and scale 
poorly. 

Aim. The purpose of this study is to measure various cognitive characteristics of 
crowdsourced workers, and explore the effect that these characteristics have upon 
judgment reliability, as measured against a human gold standard. 

Method. The authors examined whether workers with high verbal comprehension skill 
could outperform workers with low verbal comprehension skill in terms of judgment 
reliability in crowdsourcing. 

Results. A significant correlation was found between judgment reliability and measured 
verbal comprehension skill, as well as with self-reported difficulty of judgment and 
confidence in the task. Surprisingly, however, there is no correlation between level of self- 
reported topic knowledge and reliability. 

Conclusions. Our findings show that verbal comprehension skill influences the 
accuracy of the relevance judgments created by the crowdsourced workers. 


Introduction 

The term crowdsourcing was coined by Howe (2006) in a Wired 
Magazine article, and based on Web 2.0 technology. Crowdsourcing is 
defined as outsourcing tasks, which were formerly accomplished inside a 
company or institution by employees, to a huge, heterogeneous mass of 
potential workers in the form of an open call through the Internet. 
Crowdsourced workers (henceforth workers) are hired through online 
Web services such as Amazon Mechanical Turk, and work online to 
perform repetitive cognitive piece-work (known as tasks) at low cost, with 
many workers potentially working in parallel to quickly complete a task. 
Tasks are referred to as human intelligence tasks. Crowdsourcing can be 
applied widely in various fields of computer science and other disciplines 
to test and evaluate studies fZhao and Zhu. 2012T Crowdsourcing 
platforms were also suggested for the purposes of collecting survey data 
for behavioural research as a viable choice fBehrend. Sharek. Meade and 
Wiehe. 2011T Mason et al. ( 2012) have stated three advantages of 















crowdsourcing platforms: (i) allowing a large number of workers to take 
part in experiments with low payment; (ii) workers are from diverse 
countries, cultures and backgrounds, have different ages and speak 
different languages; and (iii) low cost at which the research can be carried 
out. 

Recently, the use of crowdsourcing in information retrieval evaluation has 
attracted interest f Alonso and Mizzaro. 2012k Test collections are 
frequently used to evaluate information retrieval systems, in both 
laboratory experimentation and product development. This is sometimes 
referred to as the 'Cranfield paradigm ', in reference to the original 
research on laboratory retrieval evaluation conducted by Cyril Cleverdon 
at Cranfield University f Cleverdon. 1067) . The Text REtrieval Conference 
(TREC) was established in 1992 in order to support information retrieval 
research by providing the infrastructure for large scale evaluation of 
retrieval methodologies. A test collection consists of a set of documents or 
corpus, a set of topics or search requests and relevance judgments as to 
which documents are relevant to which topics. Relevance judgments are 
made by human assessment (unless pseudo-judgments are inferred from 
user click behaviour), and hiring expert assessors to perform these 
judgments is expensive and time-consuming. As the size and diversity of 
test collections have grown, this expense has become increasingly 
burdensome. Crowdsourcing has been proposed as an economically 
viable alternative to collect relevance judgments in order to overcome this 
issue, mainly due to its low cost and fast turnaround f Alonso. Rose and 
Stewart. 2008k However, the quality and reliability of crowdsourcing has 
been questioned for several reasons, such as: 

• Workers may have inadequate expertise for the task at hand f Quinn 
and Bederson. 2011k 

• Demographic and personality traits of workers may be different and 
unrecognized, which can affect the quality of crowdsourced 
relevance judgments fKazai. Kamps and Milic-Fravling. 2012k 

• The quality of the final relevance judgments is highly subjective to 
the workers' level of interest, incentive and attention for a given task 
fKazai. Kamps and Milic-Fravling. 2011k 

A range of quality assurance and control techniques were used to reduce 
the noise (poor quality outputs) that were produced during or after the 
completion of a given task. However, little is known about the workers 
themselves and the role of individual differences in the reliability of 
crowdsourced relevance judgments. Individual differences in cognitive 
performance are defined as cognitive abilities. These abilities are mainly 
brain-based skills, concerning learning, remembering, problem-solving 
and attention and mindfulness f Ekstrom. French. Harman and Dermen. 
1076k The objective of this study was to assess the human factors that 
influence the quality of crowdsourcing output and more specifically 
crowdsourced relevance judgments. We investigated the linguistic and 
cognitive capacities of workers through tests and questionnaires. The 
cognitive ability that we tested was verbal comprehension skill; that is, 
the ability to understand the English language, believed to be one of the 












key factors influencing informational retrieval behaviour ( Allen. 1992 I. 
and potentially important also in judging the relevance of a document. 

We also tested the relationship between the reliability of the relevance 
judgments on the one hand, and self-reported difficulty of the task, 
confidence of the worker, and worker’s knowledge of the topic on the 
other. We hypothesized that more reliable judgments are produced by 
workers who are declaring the given tasks to be easy, showing higher 
confidence in their judgment and reporting themselves knowledgeable 
about the topics. The reliability of the workers was compared to that of an 
expert assessor, both directly as the overlap between relevance 
assessments, and indirectly by comparing the system effectiveness 
evaluation arrived at from expert and from worker assessors. Specifically, 
this study addressed the following research questions: 

Ri: Does the verbal comprehension skill of a worker have an effect on the 
quality of crowdsourcing output, specifically, on their relevance 
judgments? 

R2: How do a worker’s (i) topic knowledge, (ii) perceived difficulty of the 
task, and (iii) confidence in correctness, relate to the accuracy of the 
worker’s relevance judgments? 

We begin by surveying factors influencing the reliability of crowdsourcing 
output. Recent studies on the use of crowdsourcing to create relevance 
judgments for the evaluation of information retrieval systems are 
reviewed, followed by an overview of the role of cognitive abilities in the 
information retrieval process. Next, we explain our research 
methodology, design and dataset. Finally, we discuss the results of the 
experiment, and we draw our conclusions. 

Background 

Factors that affect the reliability of crowdsourcing 
output 

Crowdsourcing suffers from low quality output due to various types of 
workers’ behaviour fZhu and Carterette. 2010k In order to reduce the 
impact from malicious workers and improve the quality of crowdsourced 
output, it is useful to categorize workers according to their accuracy when 
performing the outsourced tasks, e.g., elite workers (who accomplish 
tasks with accuracy of 100%), competent workers, incompetent workers 
and so forth fGadiraju. Kawase. Dietze and Demartini. 20 ml . 

There are different factors that affect the reliability of crowdsourcing 
experiments such as experimental design, human features, and monetary 
factors. Experimental design is the most critical part of the crowdsourcing 
process f Alonso. 2012k Beyond the workers’ levels of attention, diversity 
of cultures and variations in preferences and skills, the presentation and 
properties of human intelligence tasks are the key factors for the quality 
of crowdsourcing. Indeed, the quality of the user interface, the 
instructions and the design of the crowdsourcing process have a direct 
relationship with the quality of the task performed by a worker. In the 






experimental design, the first information that needs to be presented to 
the workers is the definition of the given task. Task description is part of 
task preparation and is an important topic in implementing a 
crowdsourcing experiment. Task description along with clear instructions 
are crucial to getting a quick result. Ideally, all of the workers should have 
a common understanding about a chosen task, and the task must be 
understandable in the language of the workers f Alonso. 2012k Task 
description should be prepared according to the variation in the general 
characteristics of workers such as their language and/or the level of their 
expertise in the field fAllahbakhsh et al. 2orV) . 

Human features of a worker define a worker profile, which consists of 
one’s reputation and expertise (credentials and experience) in 
accomplishment of tasks. A worker profile has a significant influence on 
the quality of results. Requesters may provide feedback about the quality 
of the particular work to a worker. Feedback scores are used in the system 
to determine the reputation of a worker f De Alfaro. Kulshreshtha. Pve 
and Adler. 2011k Reciprocally, requesters need to enhance their 
reputations in order to increase the probability that their human 
intelligence tasks will be accepted by workers fPaolacci. Chandler and 
Ipeirotis. 2010k Information such as language, location and academic 
degree builds credentials, but the knowledge that a worker achieves 
through the crowdsourcing system is referred to as experience 
fAllahbakhsh et al.. 2013) . 

In crowdsourcing, monetary factors such as payment affect the accuracy 
of results. Workers satisfied by the payment more accurately accomplish 
tasks than those who are left unsatisfied fKazai. Kamps and Milic- 
Fravling. 2orU . Monetary and/or non-monetary reasons can be the 
motivation for the workers of crowdsourcing platforms fHammon and 
Hippner. 2012T A study conducted by Ross et al. showed that financial 
gain is a main incentive for workers in crowdsourcing f Ross. Irani. 
Silberman. Zaldivar and Tomlinson. 2010k Ipeirotis (2010) reported that 
Amazon Mechanical Turk was the main income of 27% of Indian and 12% 
of US workers. Kazai (2011) reported that increasing the payment 
enhances the quality of work whilst there is some evidence that higher 
payment has an effect only on completion time rather than on quality of 
results f Potthast. Stein. Barron-Cedeno and Rosso. 2010I and that a high 
level of payment motivates a worker to perform a task faster but not 
necessarily with better quality. Reasonable payment appears to be a more 
thoughtful and conservative solution as high pay tasks attract spammers 
as well as legitimate workers f Grady and Lease. 2010k 

Crowdsourcing in information retrieval evaluation 

In 2011, Kazai et al. investigated the relationship between workers’ 
behavioural patterns, the accuracy of their judgments and their 
personality profiles (based on the Big Five personality traits fJohn. 
Naumann and Soto. 20081 1. They found a strong correlation between the 
accuracy of judgments and the openness trait. Five types of workers 






















(spammer, sloppy, incompetent, competent, and diligent) were identified, 
based on their behavioural patterns f Zhu and Carterette. 2010L In 2012, 
Kazai et al. studied the relationship between demographics, the 
personality of workers and label accuracy. They used two different task 
designs, namely full design, which has a strict quality control, and simple 
design, with less quality control. The results showed that the 
demographics and personality of the workers were strongly related to 
label accuracy. Among demographic factors, location had the strongest 
relationship with label accuracy, with the lowest accuracy from Asian 
workers and the highest accuracy from American and European workers. 
Asian workers were more likely to undertake the simple design, while 
American and European workers were more likely to undertake the full 
design, although the difference may have been an artifact of the pre¬ 
filtering process that happened in full design (in which workers without a 
sufficient reputation score were filtered out f Kazai et al.. 20121 1. In 
another study, the effect of the level of pay, effort to complete tasks and 
qualifications needed to undertake tasks on the quality of the labels was 
investigated, and correlated with various human factors. The study found 
that higher payment leads to better output quality among qualified 
workers, but also attracts workers that are less ethical. Higher effort tasks 
lead to labels that are more inaccurate, while enticing better performing 
workers. Limiting access to tasks to reliable workers increases the quality 
of the results. Earning money is the main motivation for workers to do 
the tasks fKazai et al.. 2012) . 

Alonso et al. (2008) ran five preliminary experiments with different 
alternatives, such as qualification tests and changing interface, through 
Amazon Mechanical Turk using TREC data and measured the agreement 
between crowdsourced workers and TREC assessors. The findings 
showed that the judgments of crowdsourced workers were comparable to 
the TREC assessors. In some cases, the workers detected TREC assessors’ 
errors. In Alonso and Mizzaro (2012), the use of crowdsourcing for 
creating relevance judgments was validated through a comprehensive 
experiment. The experimental results show that crowdsourcing is a low 
cost, reliable and quick solution, and an alternative to creating relevance 
judgments by expert assessors, but it is not a replacement for current 
methods because there are still several gaps and questions that are left for 
future research. For example, the scalability of this approach has not been 
investigated yet. Blanco et al. (2011) investigated the repeatability of 
crowdsourced evaluation. The results show that crowdsourcing 
experiments can be repeated over time in a reliable manner. Although 
there were differences between human expert judgments and 
crowdsourced judgments, the system ranking was the same. Clough et al. 
(2012) compared the reliability of crowdsourced and expert judgments 
when used in information retrieval evaluation. They evaluated two search 
engines on informational and navigational queries, using crowdsourced 
and expert judgments. The study found the crowdsourced judgments 
comparable to expert judgments, with a strong positive correlation 
between search effectiveness measured by each class of judgments. In 
terms of correlation between expert judgment and crowdsourced workers, 










the disagreements were more common on documents returned by the 
better performing system and on documents returned for informational 
queries. 

Cognitive abilities in the information retrieval 
process 

This study was motivated by the theory of the information retrieval 
process, which suggests cognitive abilities most likely influence 
information retrieval effectiveness. It explored the effect of cognitive 
abilities of workers on reliability of relevance judgments. We 
hypothesized that the same relationship would pertain to relevance 
assessment, as understanding the content of documents and topics in the 
relevance judgment task requires reading, understanding text and 
evaluating its relevancy. We thought it possible that people with higher 
level of cognitive ability would be more likely to create more accurate 
relevance judgments. This idea is derived from Allen fi002l who 
demonstrated that cognitive abilities influence the information retrieval 
processes. Recently, Brennan et al. claimed that information search is 
principally about cognitive activities f Brennan. Kelly and Arguello. 2014k 
Therefore, understanding the effect of cognitive abilities on search 
behaviour is an important research topic. One of the popular instruments 
to assess cognitive abilities is the kit of the Factor-Referenced Cognitive 
Tests, produced by the US-based Educational Testing Service fEkstrom et 
al.. 1076k This kit contains seventy-two tests to measure twenty-three 
different cognitive factors. The kit is still widely used in various areas of 
research f Gearv. Hoard. Nugent and Bailey. 2012 : Beatv. Silvia. 

Nusbaum. Jauk and Benedek. 2014 : Salthouse. 2014k 

In the area of information retrieval, the effect of perceptual speed, logical 
reasoning, spatial scanning and verbal comprehension skills on how well 
academic librarians are suitable for their jobs and their performance in 
searching were investigated by Allen and Allen f iQcnl . Furthermore, the 
cognitive abilities of librarians and students were compared. The results 
of this study showed that students had higher level of perceptual speed 
and librarians had higher level of logical reasoning and verbal 
comprehension skills. Cognitive abilities have an effect on information 
retrieval performance, and therefore, different approaches to information 
retrieval may be suitable for librarians and students. In a more recent 
study fBrennan et al. 2014k the effects of cognitive abilities on search 
behaviour were investigated during search tasks, measuring visualization 
ability, perceptual speed and memory. The findings of this study showed 
that among these three cognitive abilities, both perceptual speed and 
visualization ability had a higher positive correlation with search 
behaviour than memory. In a study about search effectiveness of users 
applying a TREC test collection, the effect of characteristics of users (for 
instance, whether they have some prior search experience) and their 
levels of cognitive ability was assessed fAl Maskari and Sanderson. 2011k 
Those users with higher perceptual skills and prior search experience 
demonstrated a better search effectiveness when compared with users 














with less experience and lower perceptual abilities. 


Need for cognition defines an individual difference measure of 'the extent 
to which a person enjoys engaging in effortful cognitive activity' fScholer. 
Kelly. Wu. Lee and Webber. 2013 . p. 624). A study of the impact of need 
for cognition on relevance assessments showed that the participants with 
high need for cognition had a significantly higher level of agreement with 
expert assessors in terms of relevance assessment than low need for 
cognition participants. Indeed, we consequently expected that 
crowdsourced workers with high cognitive abilities would be likely to 
create more reliable relevance judgments. We predicted that information 
retrieval practitioners, in choosing individuals to create relevance 
judgments, would be likely to select workers with higher level of cognitive 
abilities. To the authors’ knowledge, however, no studies have 
investigated the cognitive abilities of crowdsourced workers and their 
effect on the workers’ reliability in judging the relevance of documents. 

Method 

Experiment data 

Eight topics were chosen from the TREC-9 Web Track 
f http: //trec.nist.gov/data/tQ.Weh.htmn . and twenty documents were 
randomly obtained for each topic from the WTiog document collection 
fhttp://ir.dcs.gla.ac.uk/test collections/wtiog.htmD . All documents and 
topics were in English. According to the original TREC assessors, of the 
twenty chosen documents, ten were relevant and ten were non-relevant. 
However, in creating a relevance judgment set, based on the TREC 
setting, many non-relevant outcomes would be presented to the TREC 
assessors compared to the number of relevant ones. The reason for 
selecting an equal number of relevant and non-relevant documents in this 
study was that a long sequence of irrelevant documents might cause an 
assessor to lower their threshold of relevance, or instead to lose attention 
and miss relevant documents fScholer et al.. 2013k For each of the 160 
topic, document pairs, ten binary judgments were obtained through 
crowdsourcing (each one from a different worker), and a total of 1600 
judgments were made by workers. The number of workers who 
performed the tasks was 154. In this study, the gold standard dataset was 
the relevance judgments set created by the official TREC assessors to 
which relevance judgments made by the crowdsourced workers were 
compared. 

Task design 

In this study, there were forty tasks designed in f Crowdflower) . a popular 
crowdsourcing platform. Each task was to be completed by 10 workers 
and required two steps to be completed. In the first step, each task had 
four topics, and each topic had a document to be assessed for the 
relevance judgment against a given topic. Upon completing each 
judgment, the workers were required to complete a questionnaire. The 








questionnaire consisted of the following three items, to be answered on a 
4-point scale: 

Question l) Rate your knowledge on the topic: (Minimal 1234 
Extensive). 

Question 2) How difficult was this evaluation: (Easy 1234 Difficult). 
Question 3) How confident were you in your evaluation: (Not confident 1 
234 Very confident). 

The second step was to examine the workers’ verbal comprehension skills. 
In this step, the workers were asked to complete a vocabulary test of ten 
out of thirty-six questions. These questions were randomly sampled from 
the suite of evaluation exercises known as the Kit of Factor-Referenced 
Cognitive Test f Ekstrom et al.. 1076k The workers were required to 
choose one of four words that had a similar meaning to the given word. 
The verbal comprehension score was then calculated on the basis of the 
overall vocabulary task. 

In this study, the relevance judgment task setting was different from the 
classic judgment method by TREC assessors in which a TREC assessor 
judges all documents from the same topic. The reason why we had four 
topics and documents in each task was the use of the crowdsourcing 
platform in our experiment. Crowdsourcing tasks are commonly short. 

For a long and complex task, the suggestion is to split the task into a few 
simpler tasks, because they may attract more workers to complete the 
experiment fAlonso. 2012k In addition, because we were using 
crowdsourcing, we had to limit the number of questions for the 
vocabulary test to ten (out of thirty-six) questions. We also assessed 
whether the ten questions (10-question test) could provide acceptable 
outcomes to evaluate the verbal comprehension skill of workers. The later 
was assessed by comparing the outcomes with that of the full thirty-six 
question version of the test. Two verbal comprehension scores were 
calculated for each worker; one for the thirty-six question test and 
another one for the ten question test. According to the median split of the 
comprehension score, the workers were categorized into two groups, 
namely low verbal comprehension score and high verbal comprehension 
score, based on two scores. The kappa (k) for goodness of fit was then 
calculated to find out whether there was an agreement for the grouping 
fPallant. 20 oil between the ten question test and the thirty-six question 
test. The kappa measure of agreement was used to evaluate the 
consistency of the two tests, showing a strong agreement (k = 0.70) 
between the two tests. Therefore, in this study, the use of the ten question 
test could work around the limitations posed by crowdsourcing, and could 
provide a statistically meaningful tool to assess workers’ verbal 
comprehension skill. 

Filtering spam 

Crowdsourcing is subject to untrustworthy workers, who complete tasks 
fast but carelessly (with least effort), just to earn the money. Filtering 





such workers is a common quality control procedure in crowdsourcing ( 
Kazai et al. 2013! As the vocabulary test is a multiple-choice test with 
four choices per question and ten questions, a worker selecting at random 
has an expected score of 2.5. Put another way, a worker selecting at 
random has less than a one in four chance of achieving a score of 4 or 
higher. In our experiment, workers completed the vocabulary test for 
each task they accepted. The filtering method was based on the score of 
the vocabulary test achieved by a worker for each task. Those tasks in 
which workers achieved verbal comprehension scores of 3 or less were 
considered unreliable: either they were spammers or workers with no 
English language ability. The intention behind considering tasks rather 
than workers in the filtering process was that a worker might accomplish 
different tasks with various levels of accuracy. In other words, a worker 
might accomplish one task precisely and another task precipitately. 
Applying the filtering technique in this study, there were eighty-one 
unreliable tasks out of 400 tasks. Therefore, of the 1600 judgments 
submitted, we could only consider 1276 judgments as reliable, 
constituting 147 workers (out of 154). 

Analysis methods 

Reliability of relevance judgments is measured as the agreement between 
the worker and the TREC expert assessor (gold standard). However, 
relevance judgments are subjective and can vary among assessors f Kazai 
et al. 2012k For instance, an agreement between two TREC assessors 
was reported 70% to 80% on average f Voorhees and Harman. 2 QoA) . In 
this study, the agreement in terms of relevance judgments was evaluated 
based on two different methods: (i) percentage agreement, which is the 
simplest and easiest measure calculated by dividing the number of times 
for each rating (e.g. 1, 2,... 5) assigned by each assessor, by the total 
number of the ratings, and (ii) Cohen's kappa, which is an adjusted 
version of accuracy measuring the probability of chance agreement. 
Qualitative interpretation of Cohen’s kappa is established through a five- 
level scale proposed by Landis and Koch fi 077 l . The five-level scale 
consists of slight agreement (0.01-0.20), fair agreement (0.21-0.40), 
moderate agreement (0.41-0.60), substantial agreement (0.61-0.80) and 
perfect agreement (0.81-0.99). 

Accuracy of the relevance judgments is the proportion of judgments on 
which the worker and the gold standard agree (i.e. TREC assessors in our 
study) f Kazai et al.. 2012L Accuracy is ranged from o (no agreement) to 1 
(complete agreement). Accuracy can be measured over the number of 
documents included in a single human intelligence task (in our case there 
were four tasks): 


Accuracy = 


£ Cor reel Judgments 
^Judgments 


( 1 ) 


We used Pearson’s correlation coefficient to measure the relationship 
between two real-valued user factors (for instance, verbal comprehension 









score and accuracy). A correlation of l means a perfect positive linear 
correlation (i.e. the factors form an upward-sloping straight line if plotted 
on a graph); a correlation of o means there is no correlation (as would 
occur if the factors were independent); and a correlation of -l means 
perfect negative linear correlation (a downward-sloping straight line). 

The agreement between effectiveness evaluation scores over two sets of 
systems can be measured by Kendall’s tau f Kendall. icn 81 . which is a 
standard procedure in information retrieval evaluation f Scholer. Turpin 
and Sanderson. 2011k Kendall’s tau measures the agreement in the 
ranking between two sets of paired values. The motivation for its use in 
information retrieval evaluation is that we primarily care about whether 
one system is better than another, but do not place much importance on 
the precise value of the effectiveness metric. We used Kendall’s tau to 
measure the agreement between the system rankings produced using 
worker and gold standard judgments. In this study, the independent- 
samples t-test, a parametric significance test, was used to compare two 
independent groups. The chi-squared test for independence was used to 
explore relationships between categorical variables. This test compares 
the observed proportion of cases that occur in each of the categories, and 
tests the null hypothesis that the population proportions are identical 
fSoboroff. Nicholas and Cahan. 2001k 

Results and discussion 

The effects of workers' level of verbal 
comprehension skill on reliability of judgments 

Workers were divided into two groups based on their verbal 
comprehension scores, the high values above the median and the low 
values below the median. A median split is one method for turning a 
continuous variable into a categorical one f Reis and Judd. 2000k 

• Group 1 - low score: verbal comprehension score between 4 and 8, 
consisting of 156 tasks. 

• Group 2 - high score: verbal comprehension score between 9 and 
10, consisting of 163 tasks. 

(As described previously, tasks with a verbal comprehension score below 
4 were filtered out as spammers or as having no English language 
competence). 

Agreement between Group 1 and TREC assessors (35.93% on relevant 
and 26.01% on not relevant) is 61.94%. The level of disagreement between 
them is 34.2% while 3.7% of workers chose “Don’t know”. The level of 
agreement between Group2 and TREC assessors is 75.9% (32.7% on 
relevant and 43.1% on not relevant), which is higher than that observed 
for Groupi. The disagreement between Group 2 and TREC assessors and 
the percentage of those workers who chose “Don’t know” is 21.4% and 
2.6%, respectively. 

Cohen’s kappa agreement between the relevance judgments of 







crowdsourced workers and TREC assessors is 0.3 (fair agreement) for 
Group 1 and is 0.57 (moderate agreement) for Group 2. Apparently, 

Group 2 is more reliable in their judgments when compared with Group 1. 
Consistent with the results of a previous study f Alonso and Mizzaro. 
2012k neither of the groups showed a strong agreement when compared 
with the gold standard. The study showed an agreement level of 68% 
between relevance judgments created by workers and relevance 
judgments provided by TREC assessors, which is a fair agreement. In 
another study fAl Maskari. Sanderson and Clough. 2008k the agreement 
between relevance judgments of TREC and non-TREC assessors 
(recruited to perform a search task) for fifty-six topics showed a moderate 
agreement. 

Pearson’s correlation between verbal comprehension score and accuracy 
is 0.32 (p < 0.001). The verbal comprehension score shows a moderate 
but significant correlation with accuracy. Dividing workers into low and 
high score groups, we again see that higher verbal comprehension skill 
leads to higher accuracy (Figure 1), and there are significant differences in 
accuracy (t (317) = -5.20, p < 0.001) between Groupi (p = 0.62, o = 0.25) 
and Group2 (p = 0.76, o = 0.21). 



Figure 1: Accuracy in relation to verbal comprehension score. 

The effects of workers' level of verbal 
comprehension skill on system rankings 

The influence of crowdsourced judgments on system rankings was 
assessed to find out whether crowdsourced judgments are reliable for 
evaluation purposes. One set of relevance judgments was generated from 


















Group l, and another set from Group 2. Multiple assessors assessed the 
documents, and majority voting method was used to aggregate the 
judgments. Each relevance judgment set consisted of 160 relevance 
judgments. The information retrieval systems that participated in the 
TREC-9 Web Track were then scored using Mean Average Precision 
(MAP), ranked using the Group 1 judgments, and then using the Group 2 
judgments. Each of these rankings was compared to the ranking achieved 
by the systems on the original TREC assessments. Kendall’s tau was 
computed for this rank comparison. The Kendall’s tau correlation 
coefficients between workers and TREC assessors is shown in Table 1. 
Figure 2 presents the system rankings. 

As explained earlier, ten topics were chosen in this study. In a typical 
system-based TREC experiment, it was suggested to use fifty topics to 
intensify the reliability of the experimental results CBucklev and 
Voorhees. 2000k However, average precision is considered to be a 
reasonably stable and discriminating option for general purpose retrieval. 
A plot was presented in f Bucklev and Voorhees. 2000I showing the 
average error rate over 100 trials (where each trial's error rate is the 
average over the fifty permuted query sets) for each of the topic set sizes 
smaller than fifty. For all measures, the average error rate decreases as 
the number of topics increases. Precision (depth = 10) has a relatively 
higher error rate than mean average precision which has a relatively 
lower error rate at small topic set sizes f Bucklev and Voorhees. 2000T In 
this study, we initially computed mean average precision at the evaluation 
depth of 1000. However, multiplying across the number of systems that 
participated in TREC-9 may suggest that the vast majority of topic- 
document pairs that occur in any particular system run will be unjudged. 
Usually, mean average precision will simply treat these (majority) 
unjudged items as being not relevant. This scenario could have a 
substantial impact on the metric scores. Hence, to address this issue, we 
considered calculating mean average precision to a shallower depth of 10. 


Workers 

Tau depth = 10 

kflrrmraw.M 

Group 1 (low score) 

0.73 

0.86 

Group 2 (high score) 

0.85 

0.90 


Table 1: Kendall’s tau between workers and TREC assessors. 
















Systems 

Figure 2a: mean average precision scores based on TREC assessors. Group 1 
and Group 2 for depth 10 
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/> 

Figure 2b: mean average precision scores based on TREC assessors. Group 1 
and Group 2 for depth 1000 

(In both figures the systems are sorted in ascending order of mean average precision scores 
generated using TREC assessors' judgments) 

There is a slightly higher correlation between TREC rankings and those 
using the Group 2 judgments (for depth to and 1000) than those using 
the Group l judgments. This trend reveals that the assessments 
performed by workers with high verbal comprehension skill to evaluate 
and rank retrieval systems by effectiveness are relatively similar to those 
of the official TREC assessments. This finding is consistent with previous 
studies, which reported that different judges have a little effect on system 


















rankings. For instance, in a study investigating the effect of task design on 
the system rankings, the authors found that a full set of quality control 
methods can lead to better system rankings showing a high correlation 
with the system rankings of the gold set. Accordingly, they found that 
removing low accuracy workers had a slight effect on system ranking 
fKazai. Kamps. Koolen and Milic-Fravling. 2011k In a separate study 
fTrotman and Jenkinson. 2007k the Spearman’s r rank correlation 
between multiple judges and gold set for sixty-four systems was 0.95. 
They concluded that different judges have a little effect on system 
rankings. However, a number of studies found that the variation in 
relevance judgments does not have an influence on system rankings fFesk 
and Salton. io68 : C. W. Cleverdon. i 07 Q : Kazhdan. iQ 70 : Burgin. 1QQ21 . In 
a preliminary study f Voorhees. 2000k the relevance judgments of both 
National Institute of Standards and Technology (NIST) judges and 
University of Waterloo judges were compared for a TREC-6 dataset. The 
Kendall’s tau correlation between these two groups showed 0.896 for 
seventy-six systems ranked by mean average precision. The study 
concluded that the variation in relevance judgments rarely influences the 
system rankings. 

The effect of self-reported worker features on 
relevance judgment accuracy 

After judging the relevance of each topic and document, workers rated 
their confidence in their evaluation using a 4-point Likert scale. Table 2 
summarizes the accuracy for each level of confidence, across 1276 
relevance judgments. The result showed that less confident workers 
achieved lower accuracy rates for the relevance judgments, while 
confident workers achieved higher accuracy rates. The result of the chi- 
squared test for the relationship between confidence and accuracy was 
significant (y2 = 20.05, P < 0.01). 



Level 

Number of 
judgments 

Correct 

judgments 

Accuracy 

Confidence in 
judgment 

1 

44 

21 

0.47 

2 

170 

103 

0.60 

3 

530 

368 

0.69 

4 

532 

391 

0.73 

Difficulty of 
the judgment 

1 

342 

276 

0.80 

2 

303 

210 

0.69 

3 

541 

345 

0.63 

4 

90 

52 

0.57 

Knowledge of 
the topic 

1 

207 

138 

0.66 

2 

319 

241 

0.75 

3 

469 

335 

0.71 

4 

281 

169 

0.60 


Ratings in Table 2 are based on a 4-point Likert-type 
scale, ranged between level 1 and level 4 for minimum to 
maximum level of confidence in judgment/difficulty of the 
judgment/knowledge on the topic. 


Table 2: Relationship between confidence in judgment, difficulty of the 
judgment, knowledge of the topic, and accuracy of judgments. 
































Once workers had performed a relevance judgment evaluation of each 
topic and document, they were asked to rate the level of difficulty of the 
evaluation using a 4-point Likert scale. The accuracy was then calculated 
for each level of difficulty to find out whether the difficulty level of a 
judgment influences the accuracy of the performance. Table 2 shows the 
accuracy for each level of difficulty. Those workers who claimed that a 
task was difficult achieved lower accuracy, while the workers who found a 
task easy obtained higher accuracy. A chi-squared test for the relationship 
between difficulty and accuracy was significant (y2 = 34.22, p < 0.01). 

The workers were asked to rate their knowledge about a given topic, using 
a 4-point Likert scale. Interestingly, the results showed that those 
workers with extreme level of self-reported knowledge (either low or 
high) were less accurate when compared with those who rated their 
knowledge at the moderate level. The relationship is significant using the 
chi-squared test for equality of proportions (x 2 =18.56, p < 0.01), even 
though the relationship is apparently not monotonic (Table 2). This trend 
may be in conflict with what would be generally expected. There are 
several possibilities to justify this finding. Firstly, knowledge on the topic 
was self-reported and it might show their work in a better light. Secondly, 
the responses could refer to the workers’ attitude and confidence in their 
tasks. The reason why workers with high self-reported knowledge are 
apparently less reliable may be that incompetent workers have an inflated 
sense of their own knowledge f Behrend et al.. 2011I : or, if self-reported 
knowledge was accurate, it may be that those knowledgeable workers 
were more opinionated, and for that reason most likely to disagree with 
the original assessor on the relevance of an article to a topic. To 
summarize, our results seem consistent with a previous work which found 
that knowledge on the topic did not influence the accuracy of relevance 
judgments fKazai etal. 2012L while contrasting with previous studies 
which found that knowledge on the topic and the task plays an important 
role in the accuracy of relevance judgments f Bailev et al. 2Qo8 : Kinney. 
Huffman and Zhai. 2008L 

Conclusion 

This article has presented the results of an experiment in which 
crowdsourced workers performed relevance assessments in the 
evaluation of information retrieval systems. Our objective was to explore 
the relationship between workers’ verbal comprehension skill and self- 
reported competence on the one hand, and assessment reliability on the 
other, where assessment reliability was measured as agreement with the 
original expert human assessors. We found a significant positive 
correlation between verbal comprehension skill and judgment reliability. 
Similarly, when the assessments of workers with high verbal 
comprehension skill were used to evaluate and rank retrieval systems by 
effectiveness, they gave a ranking more similar to that of the official 
assessments than when the assessments of workers with low verbal 
comprehension skill were used. 






The findings around self-reported competence were more mixed. 

Workers who reported greater confidence in their assessments, and found 
the task easier, gave more reliable judgments. Workers reporting high 
knowledge on the topic, however, gave the least reliable judgments (or at 
least, those least likely to agree with the original assessors), while the 
more reliable workers where those reporting only moderate knowledge. 
Whether this surprising finding is due to the self-confidence of 
incompetence, or the opinionatedness of the capable, or to some other 
effect requires further study. In any case, relying on self-declared 
knowledge to gauge assessment reliability is clearly questionable, and 
more objective measures of ability should be sought. 

In summary, our findings show that verbal comprehension skill 
influences the accuracy of crowdsourced workers who create the 
relevance judgments set. In the light of the findings, it is reasonable to 
argue that certain worker characteristics can be used to predict accuracy 
or to explain differences in accuracy between worker groups. Indeed, 
finding the relationship between cognitive abilities of crowdsourced 
workers and the reliability of relevance judgments can lead to new quality 
control approaches to improve the reliability of relevance judgments. 
However, as this experiment was conducted on a small dataset, in future 
work we are going to investigate whether these findings remain stable on 
a larger scale. The interesting result of the relationship between self- 
reported competence (confidence and difficulty) and reliability of 
relevance judgments motivates further investigation to utilize this 
competence in quality control approaches. 
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